library(devtools)
devtools::install_github("davzoku/onemapsgapi")Take Home Exercise 3
1 Assignment Task
2 Overview
HDB flats play a central role in Singapore’s housing landscape, providing homes for the majority of its population. According to Census of Population 2020, 87.9% of residents own their homes, and 78.7% of households live in HDB flats. This prompts an essential question for every resident: What drives HDB resale prices? Could it be the flat’s location and proximity to essential services, or could it be more influenced by factors like flat type and size?
3 Objective
This project is especially meaningful to me as an aspiring homeowner interested in owning a 5-room HDB flat. My goal is to apply what I’ve learned in this module by sourcing, curating, and cleaning both geospatial and aspatial data from various sources, including data.gov.sg, OneMap, LTA Data Mall, and HDB, to build a comprehensive dataset for predictive modeling.
Using resale data for 5-room HDB flats from January to December 2023, I will train and compare three models: an OLS Multiple Linear Regression Model, a Random Forest and a Geographically Weighted Random Forest (gwRF) Model. These models will be used to predict 5-room HDB resale prices in Singapore for July to September 2024.
4 Installing and Launching the R Packages
The following R packages will be used in this exercise:
| Package | Purpose | Use Case in Exercise |
|---|---|---|
| dotenv | Manages environment variables | Stores and retrieves API tokens securely for OneMap API access. |
| sf | Handles spatial data; imports, manages, and processes geospatial data | Manages geospatial data, such as Singapore’s boundary data and geocoded resale data. |
| onemapsgapi | Accesses OneMap API for geospatial data | Retrieves geospatial data like community clubs, libraries, and parks for analysis. |
| httr | Facilitates HTTP requests and API interactions | Sends API requests to OneMap to retrieve geospatial data for specified themes. |
| tidyverse | Provides tools for data manipulation and visualization | Cleans and processes datasets, such as merging geospatial features and handling missing values. |
| xml2 | Parses and processes XML data | Reads KML files for features like hawker centers and supermarkets. |
| rvest | Scrapes web data | Extracts data from Wikipedia for shopping mall locations. |
| jsonlite | Processes JSON data | Reads JSON files, such as primary school data retrieved from OneMap’s undocumented API. |
| units | Handles unit conversions | Converts and manages units when calculating proximity distances in meters. |
| matrixStats | Performs fast matrix operations | Efficiently calculates proximity distances in bulk, like computing minimum distances to amenities. |
| tmap | Provides tools for thematic maps | Visualizes spatial distributions of HDB resale prices and proximity features across Singapore. |
| ggplot2 | Creates static graphics | Plots distributions of resale prices and feature relationships. |
| plotly | Creates interactive plots | Visualizes variable importance and residuals in interactive bar and scatter plots. |
| ggstatsplot | Enhances ggplot2 with statistical analysis features | Creates correlation matrices for multicollinearity checks. |
| ggpubr | Simplifies publication-ready visualizations | Arranges multiple ggplot2-based plots for side-by-side comparison. |
| olsrr | Provides tools for regression model diagnostics | Performs diagnostic checks on the multiple linear regression model, including residual analysis. |
| knitr | Formats data into tables for reporting | Generates well-formatted tables to present RMSE and MAE scores of models. |
| kableExtra | Enhances knitr tables | Styles tables for readability and professional presentation. |
| spdep | Analyzes spatial dependence | Conducts Moran’s I Test to evaluate spatial autocorrelation in model residuals. |
| sfdep | Provides spatial statistics with sf objects | Generates distance-based weights for spatial autocorrelation tests, like Moran’s I. |
| ranger | Builds fast, memory-efficient random forests | Constructs a Random Forest model for predicting HDB resale prices. |
| GWmodel | Provides geographically weighted regression models | Fits a Geographically Weighted Random Forest (gwRF) model for spatially varying predictions. |
| SpatialML | Supports machine learning on spatial data | Utilizes spatial machine learning techniques for geographically weighted modeling. |
| Metrics | Calculates common evaluation metrics | Computes RMSE and MAE to evaluate model accuracy on the test set. |
| caret | Provides tools for model training and tuning | Supports model training and feature importance extraction in regression and machine learning contexts. |
As compared to previous Take-Home exercse, the data acquisition process for this exercise is quite involved as it requires us to identify the relevant data, locate reliable sources, verify the data’s suitability for our use case, and perform thorough sanity checks.
During this process, I stumbled upon Megan’s work which provided valuable guidance. However, the process was not without challenges. Data sources on official government portals had changed, and the jolene-lim/onemapsgapi library was no longer functional. To address this, I forked the library, made adjustments to handle the latest OneMap API changes, and used this modified version, davzoku/onemapsgapi, for my data acquisition process.
To install the forked development version of onemapsgapi, we can use the devtools library to install the library from the source code directly from GitHub:
For other libraries, we can use pacman as per usual:
pacman::p_load(dotenv, sf, onemapsgapi, httr, tidyverse, xml2, rvest, jsonlite, units, matrixStats, tmap, ggplot2, plotly, ggstatsplot, ggpubr, olsrr, knitr, kableExtra, spdep, sfdep, ranger, GWmodel, SpatialML, Metrics, caret)5 Model Factors
To build the model, we will examine the following factors as recommended on the assignment task. Note that after studying relevant materials such as HDB Annual Reports and availability of relevant factors, I have included additional factors on top of the recommended list. These additional factors are marked with (NEW).
5.1 Structural Factors
- Area of the unit
- Floor level
- Remaining lease
- Age of the unit
- Main Upgrading Program (MUP), Home Improvement Program (HIP) completed
5.2 Locational Factors
5.2.1 Accessibility
- Proximity to CBD
- Proximity to MRT
- Number of bus stops within 350m
5.2.2 Recreational and Lifestyle
- Proximity to park
- Proximity to food courts/hawker centres
- Proximity to shopping malls
- Proximity to supermarkets
-
Proximity to SportsSG Sports Centre (NEW)
- Accessibility to sports facilities may be important to residents who lead an active lifestyle, making flats closer to sports centers more attractive and potentially boosting their resale prices.
-
Proximity to Community Club (NEW)
- Community clubs offer recreational, educational, and social services, enhancing the quality of life for nearby residents. This added convenience can positively impact resale prices as buyers may value proximity to community amenities.
-
Proximity to National Library (NEW)
- Libraries provide educational resources and spaces for community activities. Being close to a national library can be an attractive feature for families and individuals who value learning and community engagement, potentially increasing the appeal of nearby flats.
5.2.3 Educational and Family Services
- Proximity to good primary schools
- Number of kindergartens within 350m
- Number of childcare centres within 350m
- Number of primary schools within 1km
5.2.4 Healthcare and Eldercare
- Proximity to eldercare facilities
-
No. of CHAS clinics within 350m (NEW)
- CHAS (Community Health Assist Scheme) clinics provide subsidized healthcare services, making them important for families and elderly residents. Proximity to these clinics can increase a flat’s desirability for those who prioritize affordable and accessible healthcare, potentially influencing resale prices.
6 Data Sources
Here’s the information in markdown format with sample descriptions filled in:
| Dataset | Source | Description |
|---|---|---|
| Master Plan 2019 Subzone Boundary | data.gov.sg | Geospatial boundaries for Singapore’s planning subzones in 2019. |
| Pre-Schools Location | data.gov.sg | Location data for pre-schools in Singapore. |
| HDB Resale Flat Prices | data.gov.sg | Transaction data for resale prices of HDB flats. |
| MRT and LRT Station Exits | LTA DataMall | Geolocation of MRT and LRT station exits. |
| Bus Stops | LTA DataMall | Geographic location of bus stops across Singapore. |
| Hawker Centres | data.gov.sg | Location data for hawker centers, contributing to lifestyle and accessibility factors. |
| Supermarkets | data.gov.sg | Locations of supermarkets in Singapore. |
| CHAS Clinics | data.gov.sg | Locations of CHAS clinics offering subsidized healthcare. |
| Community Clubs | OneMap | Location data of community clubs. |
| Libraries | OneMap | Locations of national and regional libraries. |
| SportsSG Sports Centres | OneMap | Locations of SportsSG facilities. |
| Eldercare Facilities | OneMap | Geolocation of eldercare facilities, important for assessing accessibility to senior services. |
| Primary Schools | OneMap | Locations of primary schools to analyze proximity to educational institutions. |
| Parks | OneMap | Geospatial data for parks in Singapore. |
| Shopping Malls | Wikipedia | Manually collected data on shopping malls in Singapore. |
| HDB MUP and HIP Programs | HDB | Data on HDB’s Main Upgrading Program (MUP) and Home Improvement Program (HIP), relevant for resale value analysis due to upgrades. |
| Top 20 Primary Schools | Various sources | List of top primary schools in Singapore |
| National Parks | OneMap | Location data of national parks. |
7 Data Collection and Geocoding
This section outlines how I collect datasets through APIs or manual methods and aspatial datasets are converted into geospatial datasets via geocoding.
7.1 Geocoding (resale)
For the data acquisition process, I’ll start by retrieving geospatial information for the relevant resale flats. This involves using the OneMap API to fetch latitude and longitude coordinates based on specified addresses.
Referencing code from In-Class Exercise 10, We’ll filter for 5 ROOM resale transactions from Jan 2023 to Dec 2024 and Jul 2024 to Sep 2024. As such, we save the number of API calls needed to OneMap by retrieving only data that we need.
# sanity check that filter is as intended
distinct_months <- resale %>%
distinct(month) %>%
arrange(month)
distinct_monthsNext, we will tidy the data by creating new columns: - address: Combines block and street_name to form a complete address. - remaining_lease_yr: Extracts the remaining lease years as an integer. - remaining_lease_mth: Extracts the remaining lease months as an integer.
resale_tidy <- resale %>%
mutate(address = paste(block,street_name)) %>%
mutate(remaining_lease_yr = as.integer(
str_sub(remaining_lease, 0, 2)))%>%
mutate(remaining_lease_mth = as.integer(
str_sub(remaining_lease, 9, 11)))Next, we generate a sorted list of unique addresses from the filtered dataset, which will be used to retrieve geographical coordinates:
We are expected to make 3329 calls to OneMap API for geocoding. To streamline the process, we will use a helper function, get_coords to perform geocoding.
Code
get_coords <- function(add_list) {
url <- "https://onemap.gov.sg/api/common/elastic/search"
found <- data.frame()
not_found <- data.frame()
for (i in seq_along(add_list)) {
address <- add_list[i]
# verbose debug
# print(i)
query <- list('searchVal' = address, 'returnGeom' = 'Y',
'getAddrDetails' = 'Y', 'pageNum' = '1')
res <- GET(url, query = query)
if (content(res)$found != 0) {
# Retrieve and process the geospatial data
tmp_df <- data.frame(content(res))[4:13]
tmp_df$address <- address # Add address to the found data
found <- rbind(found, tmp_df)
} else {
# Add to not_found if no data is found
not_found <- rbind(not_found, data.frame(address = address))
}
}
# Return a list containing both found and not_found dataframes
list(found = found, not_found = not_found)
}The output of get_coords is a list of found and not_found data frames. Successful API calls, where coordinates are retrieved, are stored in found, while unsuccessful calls (where no data was returned) are recorded in not_found.
coord_results <- get_coords(add_list)
found <- coord_results[["found"]]Remember to review thenot_found dataframe to catch any API failures or missing data.
We will save the found data frame for future reference if needed:
write_rds(found, "data/rds/found.rds")
found <- read_rds("data/rds/found.rds")Next, we will tidy the columns to an simpler format for easy referencing.
Code
found_tidy <- found %>%
select(results.BLK_NO, results.ROAD_NAME, results.POSTAL, results.X, results.Y, address) %>%
rename(
POSTAL = results.POSTAL,
XCOORD = results.X,
YCOORD = results.Y,
BLK_NO = results.BLK_NO,
ROAD_NAME = results.ROAD_NAME
)Using the address fields, we can join resale_tidy with found_tidy.
resale_geocoded = left_join(
resale_tidy, found_tidy,
by = c('address' = 'address'))Next, we will convert resale_geocoded tibble dataframe to sf:
resale_geocoded_sf <- st_as_sf(resale_geocoded,
coords = c("XCOORD",
"YCOORD"),
crs=3414)Then, we check for overlapping point features:
There are 7679 overlapping points. We will use st_jitter() of sf package to move the point features by 5m to avoid overlapping point features.
resale_geocoded_sf_jit <- resale_geocoded_sf %>%
st_jitter(amount = 5)8 Self-Sourcing & More Geocoding
From the list of the factors listed above, some of the factors listed are not readily available via API calls or government portals and requires manual data sourcing.
8.1 Shopping Malls
For shopping mall data, I referred to Wikipedia and manually converted it into a CSV file. While the Wikipedia page was largely accurate, I corrected minor discrepancies based on my knowledge.

shopping_malls_raw <- read_csv("data/raw_data/shopping_malls.csv")Since this dataset was manually curated, I performed a sanity check for duplicates:
duplicates_mall_name <- shopping_malls_raw %>%
filter(duplicated(mall_name))
duplicates_mall_name Next, I created a unique, sorted list of mall names for geocoding:
Then, we will reuse the get_coords() function to geocode our malls dataset:
coord_results_malls <- get_coords(addr_list_malls[1])
found_malls <- coord_results_malls[["found"]] %>%
rename(x = results.X, y = results.Y) %>%
select(address, x, y)
write_rds(found_malls, "data/rds/found_malls.rds")Remember to review thenot_found dataframe to catch any API failures or missing data.
8.2 HDB MUP, HIP Program
For HDB’s MUP and HIP program data, I used HDB’s Upgrading/Estate Renewal Programmes Webservice. This data was manually compiled, and only completed HIP and MUP projects were included. Additionally, only HDB blocks that appear in resale_geocoded_sf_jit (the resale blocks of interest) were considered.

hdb_prog <- read_csv("data/raw_data/hdb_mup_hip.csv") %>%
mutate(address = paste(block,street_name))Since the dataset is manually curated, let’s check for duplicates as sanity check.
duplicate_hdb_prog <- hdb_prog %>%
filter(duplicated(address))
duplicate_hdb_progSince hdb_prog is a subset of resale_geocoded_sf_jit, I reused found_tidy from earlier, avoiding additional API calls.
hdb_prog_geocoded <- hdb_prog %>%
left_join(found_tidy %>% select(address, XCOORD, YCOORD), by = "address") %>%
rename(x = XCOORD, y = YCOORD)To confirm completeness, I checked for any missing coordinates:
Finally, I saved the geocoded HDB program data:
write_rds(hdb_prog_geocoded, "data/rds/hdb_prog_geocoded.rds")8.3 Data Acquistion from OneMap
In this section, we’ll use the davzoku/onemapsgapi package, which interfaces with the OneMap API. We will mainly use get_theme API to retrieve geospatial data for specific locational factors.
(Optional) To explore available themes, you can use search_themes(token) to select themes relevant to your study. In the code below, I’ve pre-selected themes that appear relevant based on the theme.
Note that theme names may change over time, so verification is recommended.
OneMap API requires an API token. Do not commit or share this token publicly.
Use a .env file and include it in .gitignore. For example, I have a .env.example file. Copy .env.example to .env and paste in your API token.
Code
load_dot_env(file=".env")
token <- Sys.getenv("TOKEN")
# get list of available themes
available_themes <- search_themes(token)
# cc
communityclubs_tbl <- get_theme(token, "communityclubs")
communityclubs_sf <- st_as_sf(communityclubs_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(communityclubs_sf, "data/rds/communityclubs_sf.rds")
# lib
libraries_tbl <- get_theme(token, "libraries")
libraries_sf <- st_as_sf(libraries_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(libraries_sf, "data/rds/libraries_sf.rds")
# sports
sportsg_tbl <- get_theme(token, "ssc_sports_facilities")
sportsg_sf <- st_as_sf(sportsg_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(sportsg_sf, "data/rds/sportsg_sf.rds")
# elder
eldercare_tbl <- get_theme(token, "eldercare")
eldercare_sf <- st_as_sf(eldercare_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(eldercare_sf, "data/rds/eldercare_sf.rds")
# child
childcare_tbl <- get_theme(token, "childcare")
childcare_sf <- st_as_sf(childcare_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(childcare_sf, "data/rds/childcare_sf.rds")
# kinder
kindergarten_tbl <- get_theme(token, "kindergartens")
kindergartens_sf <- st_as_sf(kindergarten_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(kindergartens_sf, "data/rds/kindergartens_sf.rds")
# park
natparks_tbl <- get_theme(token, "nationalparks")
natparks_sf <- st_as_sf(natparks_tbl, coords=c("Lng", "Lat"), crs=4326)
write_rds(natparks_sf, "data/rds/natparks_sf.rds")From the OneMap API, we were able to retrieve the following data:
- community clubs
- library
- sports sg sports complex
- eldercare facilities
- childcare facilities
- kindergartens
- parks
8.3.1 Data Acquistion from OneMap (in the wild)
Acquiring primary school data requires additional steps beyond standard API calls to OneMap. Although OneMap provides a web-based mapping application for public visualization of geospatial data, extracting this data requires additional steps.
By using browser developer tools to examine network activity, we can identify the data source for primary schools as https://www.onemap.gov.sg/omapp/getAllPriSchools. This endpoint is undocumented in the OneMap API, suggesting it might be managed by a separate team or system. For comparison, OneMap documented APIs follows this format: https://www.onemap.gov.sg/api/public/themesvc/checkThemeStatus
See the image below for details.
After inspecting the JSON payload, I found duplicate columns and a row of null values at the start. To clean the data, I removed these unnecessary columns and dropped the null row to retain only the essential information:
pri_sch_tbl <- fromJSON("data/raw_data/getAllPriSchools_payload.json")[["SearchResults"]] %>%
select(-PageCount, -HSE_BLK_NUM, -SCH_POSTAL_CODE, -SCH_ROAD_NAME, -HYPERLINK, -MOREINFO, -LATITUDE, -LONGITUDE, -GEOMETRY) %>% # remove dup columns
slice(-1) # remove first null rows
glimpse(pri_sch_tbl)Next, I converted the cleaned data into an sf data frame using the X and Y coordinates, setting the coordinate reference system to SVY21 (EPSG:3414). This geospatial format enables easier spatial analysis and visualization.
pri_sch_sf_3414 <- st_as_sf(pri_sch_tbl, coords=c("SCH_X_ADDR", "SCH_Y_ADDR"), crs=3414)
write_rds(pri_sch_sf_3414, "data/rds/pri_sch_sf_3414.rds")8.3.1.1 Good Primary Schools
Numerous articles and studies highlight the impact of proximity to top primary schools on HDB prices. For example, Living Near A Popular Primary School: The Data On HDB Prices Within 1KM May Surprise You explores this topic.
After reviewing multiple top primary school lists from sources like Creative Campus, Joyous Learning, and Math Nuggets, I compiled a top 20 list:
Code
top_20_primary_schools <- c(
"Methodist Girls' School (Primary)",
"Tao Nan School",
"Ai Tong School",
"Holy Innocents' Primary School",
"CHIJ St. Nicholas Girls' School",
"Admiralty Primary School",
"St. Joseph's Institution Junior",
"Catholic High School",
"Anglo-Chinese School (Junior)",
"Chongfu School",
"Kong Hwa School",
"St. Hilda's Primary School",
"Anglo-Chinese School (Primary)",
"Nan Chiau Primary School",
"Nan Hua Primary School",
"Nanyang Primary School",
"Pei Hwa Presbyterian Primary School",
"Kuo Chuan Presbyterian Primary School",
"Rulang Primary School",
"Singapore Chinese Girls' Primary School"
)After converting these school names to uppercase for case consistency, I filtered pri_sch_sf_3414 to retain only the top 20 schools and saved this filtered dataset:
9 Import Geospatial Data & Pre-processing
In this section, we will import geospatial datasets that we obtained directly from the government portals and those prepared from earlier section.
Whenever possible, we will import geospatial datasets and transform them to the SVY21 coordinate reference system (CRS), represented by EPSG code 3414. This ensures that all datasets maintain a consistent CRS, facilitating accurate spatial analysis and alignment across layers.
9.1 Import Data
Master Plan Subzone 2019
mpsz_sf <- st_read(dsn = "data/raw_data", layer = "MPSZ-2019") %>%
st_transform(crs = 3414)Reading layer `MPSZ-2019' from data source
`/Users/walter/code/isss626/isss626-gaa/Take-home_Ex/Take-home_Ex03/data/raw_data'
using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS: WGS 84
MRT & LRT
train_sf <- st_read(dsn = "data/raw_data", layer = "Train_Station_Exit_Layer") %>%
st_transform(crs = 3414)Bus Stop
bus_sf <- st_read(dsn = "data/raw_data", layer = "BusStop") %>%
st_transform(crs = 3414)Hawker Centres
hawker_sf <- st_read("data/raw_data/HawkerCentresKML.kml") %>%
st_transform(crs = 3414)Supermarkets
supermarket_sf <- st_read("data/raw_data/SupermarketsKML.kml") %>%
st_transform(crs = 3414)CHAS Clinics
CHAS_sf <- st_read("data/raw_data/CHASClinics.kml") %>%
st_transform(crs = 3414)Resale
resale_geocoded_sf_jit_sf <- read_rds("data/rds/resale_geocoded_sf_jit.rds")
resale_geocoded_sf_train_jit_sf <- read_rds("data/rds/resale_geocoded_sf_train_jit.rds")
resale_geocoded_sf_test_jit_sf <- read_rds("data/rds/resale_geocoded_sf_test_jit.rds")Shopping Malls
malls<- read_rds("data/rds/found_malls.rds")
malls_sf <- st_as_sf(malls, coords = c("x", "y"), crs = 3414)HDB Upgrade Programs (MUP, HIP)
hdb_upgrade <- read_rds("data/rds/hdb_prog_geocoded.rds")
hdb_upgrade_sf <- st_as_sf(hdb_upgrade, coords = c("x", "y"), crs = 3414)Primary Schools & Good Primary Schools
sch_sf <- read_rds("data/rds/pri_sch_sf_3414.rds")
good_sch_sf <- read_rds("data/rds/pri_sch_sf_3414_top_20.rds")Community Clubs
communityclubs_sf <- read_rds("data/rds/communityclubs_sf.rds")
communityclubs_sf <- st_transform(communityclubs_sf, crs=3414)Libraries
libraries_sf <- read_rds("data/rds/libraries_sf.rds")
libraries_sf <- st_transform(libraries_sf, crs=3414)Sports Complex
sportsg_sf <- read_rds("data/rds/sportsg_sf.rds")
sportsg_sf <- st_transform(sportsg_sf, crs=3414)Eldercare
eldercare_sf <- read_rds("data/rds/eldercare_sf.rds")
eldercare_sf <- st_transform(eldercare_sf, crs=3414)Kindergarten
kindergarten_sf <- read_rds("data/rds/kindergartens_sf.rds")
kindergarten_sf <- st_transform(kindergarten_sf, crs=3414)Childcare
childcare_sf <- read_rds("data/rds/childcare_sf.rds")
childcare_sf <- st_transform(childcare_sf, crs=3414)Parks
natparks_sf <- read_rds("data/rds/natparks_sf.rds")
natparks_sf <- st_transform(natparks_sf, crs=3414)Then, let’s make a list to manage these datasets for subsequent steps:
loc_fac_sfs <- list(
resale_geocoded_jit = resale_geocoded_sf_jit_sf,
malls = malls_sf,
hdb_upgrade = hdb_upgrade_sf,
sch = sch_sf,
good_sch = good_sch_sf,
communityclubs = communityclubs_sf,
libraries = libraries_sf,
sportsg = sportsg_sf,
eldercare = eldercare_sf,
kindergarten = kindergarten_sf,
childcare = childcare_sf,
natparks = natparks_sf,
train = train_sf,
bus = bus_sf,
hawker = hawker_sf,
supermarket = supermarket_sf,
CHAS = CHAS_sf
)9.2 Data Pre-processing
For each dataset, we will perform the following checks to ensure consistency and accuracy in our geospatial analysis:
- Confirm that dimensions are in XY format only and remove any Z-dimensions if present using
st_zm(). - Check for invalid geometries and correct them.
During the data import process, we observed that a few datasets contained Z-dimensions, which we removed for consistency.
hawker_sf <- st_zm(hawker_sf, drop=TRUE, what = "ZM")
supermarket_sf <- st_zm(supermarket_sf, drop=TRUE, what = "ZM")
CHAS_sf <- st_zm(CHAS_sf, drop=TRUE, what = "ZM")From our lessons, we learned that mpsz_sf contains some invalid geometries. We’ll address this by applying st_make_valid().
From manual inspection, I observed that bus_sf includes bus stops located outside of Singapore, such as LARKIN TER in Malaysia. To ensure data consistency, I’ll validate all spatial datasets in loc_fac_sfs, filtering them to include only points within Singapore’s boundaries.
Code
loc_fac_sfs_within_sg <- list()
for (name in names(loc_fac_sfs)) {
sf_object <- loc_fac_sfs[[name]]
# Filter points that are within Singapore boundary (mpsz_sf)
sf_object_within <- sf_object[
apply(st_within(sf_object, mpsz_sf, sparse = FALSE), 1, any),
]
loc_fac_sfs_within_sg[[name]] <- sf_object_within
}After comparing loc_fac_sfs with loc_fac_sfs_within_sg, I found that five bus stops and one CHAS clinic were excluded, confirming these were located outside Singapore.
To ensure all datasets are compatible, we perform a final check on the coordinate reference system (CRS) for each item:
Code
# additional sanity checking all datasets before proceeding
for (i in loc_fac_sfs_within_sg) {
print(st_crs(i))
}Finally, loc_fac_sfs_within_sg is saved for future use:
write_rds(loc_fac_sfs_within_sg, "data/rds/loc_fac_sfs_within_sg.rds")10 Visualization & Verification
In this section, we’ll plot each spatial dataset to examine the distribution of features, verify that they fall within Singapore’s boundaries, and assess whether their locations make sense.
loc_fac_sfs_within_sg <- read_rds("data/rds/loc_fac_sfs_within_sg.rds")Code
tmap_mode("plot")
tmap_options(check.and.fix = TRUE)
for (name in names(loc_fac_sfs_within_sg)) {
sf_object <- loc_fac_sfs_within_sg[[name]]
map <- tm_shape(mpsz_sf) +
tm_polygons("REGION_N", alpha=0.4) +
tm_shape(sf_object) +
tm_dots(col = 'red', size = 0.02) +
tm_layout(
main.title = name,
main.title.position = "center"
)
print(map)
}
















During this visual inspection, we observed some point features in unexpected locations, such as supermarkets and CHAS clinics in remote areas like Tuas and Changi Airport. Manual verification suggests these are valid entities, Judging from the distribution of resale_geocoded_jit, let’s make a mental note these distant point features may not be relevant or influential when performing proximity or facility counts around resale flats.
11 Feature Engineering
In this section, we will convert variables to usable formats (e.g., ordinal levels for floor range), calculating proximity to features like the CBD, and calculate nearby facilities counts to better capture relevant structural, locational factors.
11.1 Floor Level
- Extract unique
storey_rangevalues and sort them. - Create an ordinal variable
storey_orderbased on the sortedstorey_range.
resale_geocoded_jit_sf <- loc_fac_sfs_within_sg[["resale_geocoded_jit"]]
storeys <- sort(unique(resale_geocoded_jit_sf$storey_range))
storeys [1] "01 TO 03" "04 TO 06" "07 TO 09" "10 TO 12" "13 TO 15" "16 TO 18"
[7] "19 TO 21" "22 TO 24" "25 TO 27" "28 TO 30" "31 TO 33" "34 TO 36"
[13] "37 TO 39" "40 TO 42" "43 TO 45"
storey_order <- 1:length(storeys)
storey_range_order <- data.frame(storeys, storey_order)
resale_geocoded_jit_sf <- left_join(resale_geocoded_jit_sf, storey_range_order, by = c("storey_range" = "storeys"))11.2 Remaining Lease (Months)
Combine remaining_lease_yr and remaining_lease_mth into a single column, remaining_lease, calculated in months:
resale_geocoded_jit_sf <- resale_geocoded_jit_sf %>%
mutate(
remaining_lease = replace_na(remaining_lease_yr, 0) * 12 +
replace_na(remaining_lease_mth, 0)
)
summary(resale_geocoded_jit_sf$remaining_lease) Min. 1st Qu. Median Mean 3rd Qu. Max.
551.0 823.5 911.0 914.9 1055.0 1145.0
11.3 Age of Unit (Months)
Reverse-engineer the age based on a typical 99-year lease, subtracting the remaining_lease from 99 years in months:
resale_geocoded_jit_sf <- resale_geocoded_jit_sf %>%
mutate(age = 99 * 12 - remaining_lease)
summary(resale_geocoded_jit_sf$age) Min. 1st Qu. Median Mean 3rd Qu. Max.
43.0 133.0 277.0 273.1 364.5 637.0
11.4 HDB Upgrade Program (HIP / MUP)
Load the HDB upgrade data and join it with resale_geocoded_jit_sf, adding hip and mup one-hot encoded flags to indicate whether each property has undergone the Home Improvement Program (HIP) or the Main Upgrading Program (MUP).
hdb_upgrade <- read_rds("data/rds/hdb_prog_geocoded.rds") %>%
select(address, type)
resale_geocoded_jit_sf <- left_join(resale_geocoded_jit_sf, hdb_upgrade, by = c('address' = 'address'))11.5 Proximity Distance Calculation
The Proximity function is referenced from Megan’s work.
Code
proximity <- function(df1, df2, varname) {
dist_matrix <- st_distance(df1, df2) %>%
drop_units()
df1[,varname] <- rowMins(dist_matrix)
return(df1)
}From this site, we can consider Downtown Core as the CBD location.
lat <- 1.287953
lng <- 103.851784
cbd_sf <- data.frame(lat, lng) %>%
st_as_sf(coords = c("lng", "lat"), crs=4326) %>%
st_transform(crs=3414)In the code block below, it calculates the distance from each HDB resale point to nearby facilities such as community clubs, libraries, parks, and malls, etc, generating 11 new proximity features in resale_geocoded_jit_sf:
Code
resale_geocoded_jit_sf <-
proximity(resale_geocoded_jit_sf, cbd_sf, "PROX_CBD") %>%
proximity(., loc_fac_sfs_within_sg[["communityclubs"]], "PROX_CC") %>%
proximity(., loc_fac_sfs_within_sg[["libraries"]], "PROX_LIB") %>%
proximity(., loc_fac_sfs_within_sg[["sportsg"]], "PROX_SPORTS") %>%
proximity(., loc_fac_sfs_within_sg[["eldercare"]], "PROX_ELDERCARE") %>%
proximity(., loc_fac_sfs_within_sg[["natparks"]], "PROX_PARK") %>%
proximity(., loc_fac_sfs_within_sg[["hawker"]], "PROX_HAWKER") %>%
proximity(., loc_fac_sfs_within_sg[["train"]], "PROX_TRAIN") %>%
proximity(., loc_fac_sfs_within_sg[["supermarket"]], "PROX_SMKT") %>%
proximity(., loc_fac_sfs_within_sg[["good_sch"]], "PROX_GD_SCH") %>%
proximity(., loc_fac_sfs_within_sg[["malls"]], "PROX_MALLS")11.6 Nearby Facilities Count
The nearby facilities count function, num_radius is from Megan’s work.
Code
num_radius <- function(df1, df2, varname, radius) {
dist_matrix <- st_distance(df1, df2) %>%
drop_units() %>%
as.data.frame()
df1[,varname] <- rowSums(dist_matrix <= radius)
return(df1)
}The following code calculates the number of nearby facilities within specified distances for each HDB resale point, adding five new features to resale_geocoded_jit_sf:
Code
resale_geocoded_jit_sf <-
num_radius(resale_geocoded_jit_sf, loc_fac_sfs_within_sg[["kindergarten"]], "NUM_KINDERGARTEN_350M", 350) %>%
num_radius(., loc_fac_sfs_within_sg[["childcare"]], "NUM_CHILDCARE_350M", 350) %>%
num_radius(., loc_fac_sfs_within_sg[["bus"]], "NUM_BUS_STOP_350M", 350) %>%
num_radius(., loc_fac_sfs_within_sg[["sch"]], "NUM_PRI_SCH_1KM", 1000) %>%
num_radius(., loc_fac_sfs_within_sg[["CHAS"]], "NUM_CLINIC_350M", 350)Let’s save the fully feature-engineered dataset for future reference:
write_rds(resale_geocoded_jit_sf, "data/rds/resale_geocoded_jit_full_sf.rds")
resale_geocoded_jit_full_sf <- read_rds("data/rds/resale_geocoded_jit_full_sf.rds")Next, we save a lite version containing only essential columns, removing unnecessary data fields like block numbers, street names, and postal codes:
resale_geocoded_jit_ft_sf <- resale_geocoded_jit_sf %>%
select(-block, -street_name, -storey_range, -lease_commence_date, -remaining_lease_yr, -remaining_lease_mth, -BLK_NO, -ROAD_NAME, -POSTAL, -type)write_rds(resale_geocoded_jit_ft_sf, "data/rds/resale_geocoded_jit_ft_sf.rds")12 Train-Test Split
We will also save the dataset into training and testing subsets based on the date range.
The datasets are saved as follows: - resale_geocoded_jit_ft_sf: the full dataset with jitter applied & engineered features - resale_geocoded_jit_ft_train_sf: the training subset. - resale_geocoded_jit_ft_test_sf: the testing subset.
resale_geocoded_jit_ft_train_sf <- resale_geocoded_jit_ft_sf %>%
filter(month >= "2023-01" & month <= "2023-12")
resale_geocoded_jit_ft_test_sf <- resale_geocoded_jit_ft_sf %>%
filter(month >= "2024-07" & month <= "2024-09")
write_rds(resale_geocoded_jit_ft_train_sf, "data/rds/resale_geocoded_jit_ft_train_sf.rds")
write_rds(resale_geocoded_jit_ft_test_sf, "data/rds/resale_geocoded_jit_ft_test_sf.rds")13 Exploratory Data Analysis
To prevent information leakage and reduce bias, we’ll conduct exploratory data analysis (EDA) on the training dataset only.
resale_train_sf <- read_rds("data/rds/resale_geocoded_jit_ft_train_sf.rds")Let’s observe a quick summary of the train set:
summary(resale_train_sf) month town flat_type floor_area_sqm
Length:5852 Length:5852 Length:5852 Min. : 99.0
Class :character Class :character Class :character 1st Qu.:112.0
Mode :character Mode :character Mode :character Median :116.0
Mean :117.7
3rd Qu.:122.0
Max. :153.0
flat_model remaining_lease resale_price address
Length:5852 Min. : 551.0 Min. : 418000 Length:5852
Class :character 1st Qu.: 826.0 1st Qu.: 590000 Class :character
Mode :character Median : 914.0 Median : 650000 Mode :character
Mean : 917.5 Mean : 685044
3rd Qu.:1063.0 3rd Qu.: 740000
Max. :1145.0 Max. :1480000
geometry storey_order age hip
POINT :5852 Min. : 1.000 Min. : 43.0 Min. :0.0000
epsg:3414 : 0 1st Qu.: 2.000 1st Qu.:125.0 1st Qu.:0.0000
+proj=tmer...: 0 Median : 3.000 Median :274.0 Median :0.0000
Mean : 3.435 Mean :270.5 Mean :0.1077
3rd Qu.: 4.000 3rd Qu.:362.0 3rd Qu.:0.0000
Max. :15.000 Max. :637.0 Max. :1.0000
mup PROX_CBD PROX_CC PROX_LIB
Min. :0.0000 Min. : 1610 Min. : 2.156 Min. : 59.7
1st Qu.:0.0000 1st Qu.:11422 1st Qu.: 340.759 1st Qu.: 771.0
Median :0.0000 Median :13822 Median : 525.043 Median :1157.5
Mean :0.0299 Mean :13199 Mean : 540.417 Mean :1185.3
3rd Qu.:0.0000 3rd Qu.:16055 3rd Qu.: 712.111 3rd Qu.:1593.5
Max. :1.0000 Max. :19575 Max. :1992.808 Max. :2835.2
PROX_SPORTS PROX_ELDERCARE PROX_PARK PROX_HAWKER
Min. : 68.67 Min. : 2.445 Min. : 62.4 Min. : 50.77
1st Qu.: 904.80 1st Qu.: 399.111 1st Qu.: 465.8 1st Qu.: 437.81
Median :1396.17 Median : 686.745 Median : 708.4 Median : 764.98
Mean :1551.37 Mean : 874.553 Mean : 787.9 Mean : 869.69
3rd Qu.:2067.69 3rd Qu.:1196.152 3rd Qu.:1008.8 3rd Qu.:1166.53
Max. :4196.13 Max. :3279.141 Max. :2204.6 Max. :2868.70
PROX_TRAIN PROX_SMKT PROX_GD_SCH PROX_MALLS
Min. : 22.55 Min. : 1.953 Min. : 79.89 Min. : 2.607
1st Qu.: 278.85 1st Qu.: 181.593 1st Qu.:1152.81 1st Qu.: 361.147
Median : 480.25 Median : 279.522 Median :1948.38 Median : 546.031
Mean : 561.12 Mean : 300.074 Mean :2138.98 Mean : 626.413
3rd Qu.: 750.55 3rd Qu.: 395.668 3rd Qu.:2732.04 3rd Qu.: 798.139
Max. :2127.62 Max. :1455.808 Max. :7156.54 Max. :2178.680
NUM_KINDERGARTEN_350M NUM_CHILDCARE_350M NUM_BUS_STOP_350M NUM_PRI_SCH_1KM
Min. :0.000 Min. : 0.000 Min. : 0.000 Min. :0.000
1st Qu.:0.000 1st Qu.: 3.000 1st Qu.: 6.000 1st Qu.:2.000
Median :1.000 Median : 5.000 Median : 8.000 Median :3.000
Mean :1.055 Mean : 4.921 Mean : 8.222 Mean :3.319
3rd Qu.:1.000 3rd Qu.: 6.000 3rd Qu.:10.000 3rd Qu.:4.000
Max. :8.000 Max. :22.000 Max. :18.000 Max. :9.000
NUM_CLINIC_350M
Min. : 0.000
1st Qu.: 1.000
Median : 2.000
Mean : 2.828
3rd Qu.: 4.000
Max. :13.000
To have a better sense of the data, we can visualize the distribution of data with a histogram:
Code
fill_col <- "#3498db"
ggplotly(
ggplot(resale_train_sf, aes(x = resale_price)) +
geom_histogram(bins = 30, color="black",fill = fill_col) +
ggtitle("Distribution of Resale Prices") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
)Observations
The histogram shows the distribution of resale prices for 5-room flats in Singapore from Jan - Dec 2023 and Jul - Sep 2024.
The plot spans from around SGD 400,000 to over SGD 1.5 million, and the distribution is right-skewed, with a tail extending towards higher prices at above $1 million.
We can also visualize the spatial distribution of the 5 room resale flat prices using tmap.
Code
tmap_mode("plot")
tmap_options(check.and.fix = TRUE)
tm_shape(mpsz_sf) +
tm_polygons("REGION_N", alpha=0.4) +
tm_shape(resale_train_sf) +
tm_dots(
col = "resale_price",
alpha = 0.5,
style = "quantile",
size = 0.1,
title = "Resale Price (SGD)"
) +
tm_layout(
outer.margins = c(0.1, 0, 0, 0),
main.title = "Spatial Distribution of Resale Prices for 5-Room Flats",
main.title.position = "center",
main.title.size = 1.5,
legend.position = c("left", "bottom"),
legend.title.size = 1.2,
legend.text.size = 0.8,
frame = TRUE,
inner.margins = c(0.05, 0.05, 0.05, 0.05)
)
Observation
This map shows the spatial distribution of resale prices for 5-room flats in Singapore, with darker shades indicating higher prices.
High resale prices are concentrated in the central and southern downtown areas, as well as in Toa Payoh, Bishan. High resale prices can also be observed in the north-eastern side of Singapore at Punggol area, likely due to the younger age of flats in these locations.
In contrast, the western and northern regions, such as Jurong and Woodlands, offer more affordable resale options.
13.1 Visualizing Structural Factors
To examine the structural characteristics of the dataset, we will categorize and visualize the structural factors in two groups: categorical and numerical.
First, we identify and group the structural factors:
Then, we create a plot_factors function to visualize these factors. This function takes in the dataset, a list of factors to plot, and layout specifications. Based on whether the factor is categorical or numerical, it creates either bar charts or histograms.
Code
plot_factors <- function(data, factors, ncol = 2, nrow = 2, is_categorical = FALSE) {
# create plots depending on categorical or numerical
plots <- lapply(factors, function(var) {
if (is_categorical) {
ggplot(data, aes_string(x = var)) +
geom_bar(color = "black", fill = fill_col) +
ggtitle(paste("Dist. of", var))
} else {
ggplot(data, aes_string(x = var)) +
geom_histogram(bins = 20, color = "black", fill = fill_col) +
ggtitle(paste("Dist. of", var))
}
})
ggarrange(plotlist = plots, ncol = ncol, nrow = nrow)
}Code
plot_factors(resale_train_sf, struct_factors_cat, ncol = 2, nrow = 1, is_categorical = TRUE)
Observations
The bar charts show the distribution of the mup and hip variables across the dataset:
-
MUP (Main Upgrading Program):
- The majority of records do not have the MUP completed, as indicated by the large count for
mup = 0. - Only a small fraction of records have MUP completed (
mup = 1), suggesting that this upgrade program is relatively rare in the dataset.
- The majority of records do not have the MUP completed, as indicated by the large count for
-
HIP (Home Improvement Program):
- Similar to MUP, most records do not have the HIP completed, with a large count for
hip = 0. - However, the count of
hip = 1(indicating HIP completion) is slightly higher than that of MUP.
- Similar to MUP, most records do not have the HIP completed, with a large count for
This distribution indicates that most of the resale flats in the dataset have not undergone these upgrading programs. Let’s keep a mental note on this.
Code
plot_factors(resale_train_sf, struct_factors_num, ncol = 2, nrow = 2, is_categorical = FALSE)
Observations
The figure aboves show the distribution of the categorical structural factors for the train set:
-
Floor Area (sqm):
- The majority of HDB units have a floor area between 100 and 120 sqm. (which is expected)
-
Storey Order:
- Note that x-axis for this histogram is the
storey range - Interestingly, most units transacted are located on lower storey range.
- Note that x-axis for this histogram is the
-
Remaining Lease:
- The distribution of remaining lease values reveals two prominent peaks: one around 900 months (approximately 75 years remaining) and another near 1150 months (around 95 years remaining).
- This suggests that a significant portion of the 5-room resale transactions in 2023 involve either units that have just met the Minimum Occupation Period (MOP) or are around 20 years old.
-
Age:
- The age distribution is inversely related to the remaining lease.
13.2 Visualizing Locational Factors
loc_factors <- c("PROX_CBD", "PROX_CC", "PROX_LIB", "PROX_SPORTS",
"PROX_ELDERCARE", "PROX_PARK", "PROX_HAWKER", "PROX_TRAIN", "PROX_SMKT",
"PROX_GD_SCH", "PROX_MALLS", "NUM_KINDERGARTEN_350M", "NUM_CHILDCARE_350M",
"NUM_BUS_STOP_350M", "NUM_PRI_SCH_1KM", "NUM_CLINIC_350M")
plot_factors(resale_train_sf, loc_factors, ncol = 4, nrow = 4, is_categorical = FALSE)
Observations
Proximity Variables (PROX)
General Skewness: Most proximity-based location factors (e.g., PROX_CC, PROX_ELDERCARE, PROX_PARK) show right-skewed distributions.
Proximity to CBD (PROX_CBD): This variable is slightly left-skewed, indicating that most units are further from the CBD, with fewer closer to the city center.
Common Amenities: Proximity to facilities such as Community Centers, Libraries, Malls and Sports Centers generally peaks around 500–1000 meters, showing moderate accessibility for most flats.
Number of Facility Counts within Radius (NUM)
NUM_KINDERGRATEN, NUM_CLINIC: These factors shows a strong right skew, with most units having between 0-3 facilities within 350 meters.
NUM_CHILDCARE_350M: The distribution shows a slight right skew, with most units having between 0-5 childcare centers within 350 meters.
NUM_BUS_STOP_350M: The distribution is more balanced, with counts mostly ranging between 5-10 bus stops within 350 meters. This indicates relatively good access to bus stops, which is essential for public transport connectivity.
13.3 Visualizing Train-Test Split
To understand the distribution of towns in the training and test datasets and check if the data split is stratified, we plot the town distribution across both splits by accumulating the sales numbers by town and compare between splits.
Code
resale_test_sf <- read_rds("data/rds/resale_geocoded_jit_ft_test_sf.rds")
train_counts <- as.data.frame(table(resale_train_sf$town))
colnames(train_counts) <- c("town", "train")
test_counts <- as.data.frame(table(resale_test_sf$town))
colnames(test_counts) <- c("town", "test")
town_counts <- merge(train_counts, test_counts, by = "town", all = TRUE)
town_counts <- full_join(train_counts, test_counts, by = "town") %>%
replace_na(list(train = 0, test = 0)) %>%
arrange(desc(train + test))
t_t_split_plot <- plot_ly(town_counts, y = ~town) %>%
add_trace(x = ~train, name = "Train", type = "bar", orientation = "h") %>%
add_trace(x = ~test, name = "Test", type = "bar", orientation = "h") %>%
layout(
title = list(text = "Town Distribution in Train vs Test Splits"),
xaxis = list(title = "Count"),
yaxis = list(title = "Town", categoryorder = "array", categoryarray = ~town),
barmode = "group"
)
t_t_split_plotObservations
Overall, the ratio between train and test data appears relatively consistent across towns.
Sengkang and Punggol have the highest number of data points for both train and test sets, suggesting a higher volume of resale transactions or data availability in these towns. These is reasonable as they are young towns with younger flats and flats that recently MOP’ed.
Central Area and Bukit Timah have the fewest data points, which might suggest that smaller number of HDB flats are available in the resale market or they might simply be too expensive.
14 Multicolinearity Check
Multicollinearity can affect the stability and interpretability of a regression model. To identify potential multicollinearity, we will use ggcorrmat() of ggstatsplot to plot a correlation matrix to check if there are pairs of highly correlated independent variables.
Variables with correlations above 0.5 are potential candidates for removal or further analysis, depending on their impact on the model.
resale_nogeo <- resale_train_sf %>%
st_drop_geometry()
corr_list <- c("floor_area_sqm", "remaining_lease", "storey_order", "age",
"hip", "mup", "PROX_CBD", "PROX_CC", "PROX_LIB", "PROX_SPORTS",
"PROX_ELDERCARE", "PROX_PARK", "PROX_HAWKER", "PROX_TRAIN", "PROX_SMKT",
"PROX_GD_SCH", "PROX_MALLS", "NUM_KINDERGARTEN_350M", "NUM_CHILDCARE_350M",
"NUM_BUS_STOP_350M", "NUM_PRI_SCH_1KM", "NUM_CLINIC_350M")
ggstatsplot::ggcorrmat(resale_nogeo, corr_list)
Observations
From the matrix, we observe that:
-
Age and Remaining Lease are perfectly negatively correlated (
-1), meaning they carry the same information in opposite directions. We will retain only one of these variables (remove Age) to avoid redundancy.
15 OLS Multiple Linear Regression Model
We will move on to build our first model. We fit a multiple linear regression model to predict resale_price based on various structural and locational factors. The ols_regress function from olsrr is used to evaluate the model.
resale_mlr <- lm(formula = resale_price ~ floor_area_sqm+ remaining_lease+ storey_order+
hip+ mup+ PROX_CBD+ PROX_CC+ PROX_LIB+ PROX_SPORTS+
PROX_ELDERCARE+ PROX_PARK+ PROX_HAWKER+ PROX_TRAIN+ PROX_SMKT+
PROX_GD_SCH+ PROX_MALLS+ NUM_KINDERGARTEN_350M+ NUM_CHILDCARE_350M+
NUM_BUS_STOP_350M+ NUM_PRI_SCH_1KM+ NUM_CLINIC_350M,
data = resale_train_sf)
olsrr::ols_regress(resale_mlr) Model Summary
--------------------------------------------------------------------------
R 0.864 RMSE 71124.152
R-Squared 0.747 MSE 5077734242.025
Adj. R-Squared 0.746 Coef. Var 10.402
Pred R-Squared 0.745 AIC 147412.478
MAE 53128.738 SBC 147565.992
--------------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------------------
Regression 8.739137e+13 21 4.161494e+12 819.557 0.0000
Residual 2.960319e+13 5830 5077734242.025
Total 1.169946e+14 5851
------------------------------------------------------------------------------
Parameter Estimates
------------------------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
------------------------------------------------------------------------------------------------------------------
(Intercept) -297812.438 28784.167 -10.346 0.000 -354240.085 -241384.792
floor_area_sqm 6715.327 178.883 0.339 37.540 0.000 6364.650 7066.003
remaining_lease 581.129 10.043 0.600 57.866 0.000 561.441 600.816
storey_order 19907.859 503.598 0.280 39.531 0.000 18920.621 20895.098
hip 21187.338 3586.425 0.046 5.908 0.000 14156.614 28218.061
mup 15307.484 6383.041 0.018 2.398 0.017 2794.356 27820.611
PROX_CBD -21.028 0.326 -0.585 -64.459 0.000 -21.668 -20.389
PROX_CC -29.645 4.315 -0.054 -6.871 0.000 -38.103 -21.187
PROX_LIB -31.941 2.150 -0.121 -14.857 0.000 -36.156 -27.727
PROX_SPORTS 11.558 1.494 0.069 7.735 0.000 8.629 14.488
PROX_ELDERCARE -4.585 1.859 -0.021 -2.466 0.014 -8.230 -0.941
PROX_PARK -1.503 2.534 -0.004 -0.593 0.553 -6.470 3.464
PROX_HAWKER -17.719 2.228 -0.069 -7.953 0.000 -22.087 -13.352
PROX_TRAIN -37.591 2.988 -0.099 -12.579 0.000 -43.449 -31.732
PROX_SMKT 22.426 6.755 0.025 3.320 0.001 9.183 35.669
PROX_GD_SCH -4.684 0.787 -0.046 -5.950 0.000 -6.227 -3.140
PROX_MALLS -13.517 3.046 -0.035 -4.437 0.000 -19.488 -7.545
NUM_KINDERGARTEN_350M 8438.655 1025.462 0.064 8.229 0.000 6428.370 10448.941
NUM_CHILDCARE_350M -4579.307 500.711 -0.077 -9.146 0.000 -5560.886 -3597.728
NUM_BUS_STOP_350M -106.480 342.324 -0.002 -0.311 0.756 -777.561 564.602
NUM_PRI_SCH_1KM -14934.129 742.115 -0.174 -20.124 0.000 -16388.949 -13479.308
NUM_CLINIC_350M 6868.982 527.562 0.101 13.020 0.000 5834.764 7903.200
------------------------------------------------------------------------------------------------------------------
Code
summary(resale_mlr)Observations
The R-Squared value indicates that the model explains approximately 74.7% of the variance in resale prices, suggesting a strong fit.
Based on the high p-values (> 0.05), the following variables are not statistically significant and may not meaningfully contribute to the model:
- PROX_PARK (p = 0.553)
- NUM_BUS_STOP_350M (p = 0.756)
Removing these variables can simplify the model without sacrificing predictive power. We’ll further examine variable importance in later steps.
Now, we will build another model without the variables mentioned above:
resale_mlr2 <- lm(formula = resale_price ~ floor_area_sqm+ remaining_lease+ storey_order+
hip+ mup+ PROX_CBD+ PROX_CC+ PROX_LIB+ PROX_SPORTS+
PROX_ELDERCARE
#+ PROX_PARK
+ PROX_HAWKER+ PROX_TRAIN+ PROX_SMKT+
PROX_GD_SCH+ PROX_MALLS+ NUM_KINDERGARTEN_350M+ NUM_CHILDCARE_350M
# +NUM_BUS_STOP_350M
+ NUM_PRI_SCH_1KM+ NUM_CLINIC_350M,
data = resale_train_sf)
olsrr::ols_regress(resale_mlr2) Model Summary
--------------------------------------------------------------------------
R 0.864 RMSE 71126.953
R-Squared 0.747 MSE 5076392689.525
Adj. R-Squared 0.746 Coef. Var 10.401
Pred R-Squared 0.745 AIC 147408.939
MAE 53120.877 SBC 147549.104
--------------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------------------
Regression 8.738904e+13 19 4.599423e+12 906.042 0.0000
Residual 2.960552e+13 5832 5076392689.525
Total 1.169946e+14 5851
------------------------------------------------------------------------------
Parameter Estimates
------------------------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
------------------------------------------------------------------------------------------------------------------
(Intercept) -297334.543 28237.914 -10.530 0.000 -352691.326 -241977.761
floor_area_sqm 6705.168 177.163 0.339 37.847 0.000 6357.862 7052.474
remaining_lease 580.840 9.980 0.599 58.202 0.000 561.276 600.403
storey_order 19894.393 503.072 0.280 39.546 0.000 18908.185 20880.601
hip 21452.539 3553.934 0.047 6.036 0.000 14485.510 28419.568
mup 15672.686 6358.226 0.019 2.465 0.014 3208.205 28137.168
PROX_CBD -21.102 0.308 -0.587 -68.576 0.000 -21.705 -20.498
PROX_CC -29.653 4.284 -0.054 -6.922 0.000 -38.051 -21.255
PROX_LIB -31.894 2.121 -0.120 -15.039 0.000 -36.051 -27.736
PROX_SPORTS 11.460 1.473 0.069 7.781 0.000 8.573 14.348
PROX_ELDERCARE -4.361 1.803 -0.020 -2.418 0.016 -7.896 -0.826
PROX_HAWKER -17.539 2.211 -0.068 -7.934 0.000 -21.872 -13.205
PROX_TRAIN -37.834 2.962 -0.100 -12.771 0.000 -43.641 -32.026
PROX_SMKT 22.892 6.719 0.026 3.407 0.001 9.720 36.063
PROX_GD_SCH -4.703 0.785 -0.046 -5.989 0.000 -6.242 -3.164
PROX_MALLS -13.515 3.022 -0.035 -4.473 0.000 -19.439 -7.592
NUM_KINDERGARTEN_350M 8485.827 1022.839 0.065 8.296 0.000 6480.684 10490.970
NUM_CHILDCARE_350M -4600.690 494.407 -0.078 -9.305 0.000 -5569.911 -3631.469
NUM_PRI_SCH_1KM -14988.484 736.682 -0.175 -20.346 0.000 -16432.654 -13544.313
NUM_CLINIC_350M 6839.424 525.413 0.100 13.017 0.000 5809.419 7869.428
------------------------------------------------------------------------------------------------------------------
Observations
As compared to the earlier model, we can notice that R-squared value remains the same. By removing variables which were not statistically significant in the previous model, we have simplified the model without sacrificing explanatory power.
We will save this model for further analysis:
write_rds(resale_mlr2, "data/rds/resale_mlr2.rds")15.1 Test for Linear Regression Assumptions
Referencing In-Class Exercise 07, we will run a few more tests on the resale_mlr2 model.
15.1.1 Multicolinearity Check with VIF
Variance Inflation Factor (VIF) analysis is conducted to detect the presence of multicollinearity among the predictors. High VIF values suggest redundancy, indicating that some predictors might need to be removed.
vif <- performance::check_collinearity(resale_mlr2)
kable(vif,
caption = "Variance Inflation Factor (VIF) Results") %>%
kable_styling(font_size = 18)| Term | VIF | VIF_CI_low | VIF_CI_high | SE_factor | Tolerance | Tolerance_CI_low | Tolerance_CI_high |
|---|---|---|---|---|---|---|---|
| floor_area_sqm | 1.844959 | 1.779020 | 1.916478 | 1.358292 | 0.5420176 | 0.5217904 | 0.5621072 |
| remaining_lease | 2.444561 | 2.348634 | 2.547311 | 1.563509 | 0.4090715 | 0.3925709 | 0.4257794 |
| storey_order | 1.153047 | 1.123093 | 1.190291 | 1.073800 | 0.8672674 | 0.8401310 | 0.8903985 |
| hip | 1.398739 | 1.355403 | 1.447358 | 1.182683 | 0.7149297 | 0.6909139 | 0.7377877 |
| mup | 1.351977 | 1.311067 | 1.398267 | 1.162746 | 0.7396574 | 0.7151708 | 0.7627373 |
| PROX_CBD | 1.689703 | 1.631571 | 1.753185 | 1.299886 | 0.5918201 | 0.5703906 | 0.6129062 |
| PROX_CC | 1.386885 | 1.344162 | 1.434911 | 1.177661 | 0.7210405 | 0.6969075 | 0.7439580 |
| PROX_LIB | 1.477513 | 1.430134 | 1.530110 | 1.215530 | 0.6768132 | 0.6535478 | 0.6992352 |
| PROX_SPORTS | 1.799846 | 1.736173 | 1.869027 | 1.341584 | 0.5556030 | 0.5350378 | 0.5759794 |
| PROX_ELDERCARE | 1.589665 | 1.536586 | 1.647995 | 1.260819 | 0.6290633 | 0.6067980 | 0.6507934 |
| PROX_HAWKER | 1.703971 | 1.645120 | 1.768190 | 1.305362 | 0.5868646 | 0.5655502 | 0.6078584 |
| PROX_TRAIN | 1.412188 | 1.368159 | 1.461482 | 1.188355 | 0.7081212 | 0.6842369 | 0.7309093 |
| PROX_SMKT | 1.316993 | 1.277916 | 1.361564 | 1.147603 | 0.7593056 | 0.7344496 | 0.7825240 |
| PROX_GD_SCH | 1.363063 | 1.321576 | 1.409903 | 1.167503 | 0.7336416 | 0.7092687 | 0.7566722 |
| PROX_MALLS | 1.406714 | 1.362968 | 1.455734 | 1.186050 | 0.7108764 | 0.6869387 | 0.7336932 |
| NUM_KINDERGARTEN_350M | 1.399690 | 1.356305 | 1.448357 | 1.183085 | 0.7144440 | 0.6904376 | 0.7372971 |
| NUM_CHILDCARE_350M | 1.601359 | 1.547688 | 1.660290 | 1.265448 | 0.6244695 | 0.6023044 | 0.6461250 |
| NUM_PRI_SCH_1KM | 1.707058 | 1.648052 | 1.771436 | 1.306544 | 0.5858033 | 0.5645137 | 0.6067771 |
| NUM_CLINIC_350M | 1.363316 | 1.321816 | 1.410168 | 1.167611 | 0.7335054 | 0.7091352 | 0.7565349 |
To visualize the results:
plot(vif) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Observations
The plot above shows the Variance Inflation Factor (VIF) values for each predictor in the resale_mlr2 model, indicating the degree of multicollinearity among them.
The variance inflation factor (VIF) quantifies the extent of correlation between one predictor and the other predictors in a model. The higher the value, the greater the correlation of the variable with other variables. For our VIF plot, it shows that all predictors have VIF values below 5, suggesting that multicollinearity is not a significant issue in this model.
remaining_lease and storey_order have the highest VIFs but are still within acceptable limits.
15.1.2 Test for Non-Linearity
In multiple linear regression, it is important for us to test the assumption that linearity and additivity of the relationship between dependent and independent variables.
To test the linearity assumption, we use the ols_plot_resid_fit() function.
ols_plot_resid_fit(resale_mlr2)
Observations
Most of our data points are clustered around the 0 line, with a few outliers. However, these deviations are within an acceptable range, allowing us to conclude that the relationships between the dependent and independent variables are linear.
15.1.3 Test for Normality Assumption
We can assess the normality of residuals with a histogram:
ols_plot_resid_hist(resale_mlr2)
Observations:
The figure reveals that the residual of resale_mlr2 resembles normal distribution.
For a formal test, we can use ols_test_normality():
Unlike the class exercise, we need to sample a subset of residuals as these test has a sample limit of 5000. To ensure reproducibility, a seed is set before sampling.
-----------------------------------------------
Test Statistic pvalue
-----------------------------------------------
Shapiro-Wilk 0.9706 0.0000
Kolmogorov-Smirnov 0.0463 0.0000
Cramer-von Mises 417.5645 0.0000
Anderson-Darling 21.1864 0.0000
-----------------------------------------------
Observations:
The summary table above reveals that the p-values of the four tests are way smaller than the alpha value of 0.05. Hence we will reject the null hypothesis and infer that there is statistical evidence that the residual are not normally distributed.
15.1.4 Test for Spatial Autocorrelation
Since the hedonic model involves geographically referenced data, it is important to visualize the residuals and test for spatial autocorrelation.
First, we export the residuals from the hedonic pricing model into a data frame and join it with the spatial dataframe.
mlr_res <- as.data.frame(resale_mlr2$residuals)
resale_res_sf <- cbind(resale_train_sf,
resale_mlr2$residuals) %>%
rename(`MLR_RES` = `resale_mlr2.residuals`)Using tmap, we visualize the spatial distribution of the residuals:
Code
tmap_mode("plot")
tmap_options(check.and.fix = TRUE)
tm_shape(mpsz_sf) +
tm_polygons("REGION_N",alpha = 0.4) +
tm_shape(resale_res_sf) +
tm_dots(
col = "MLR_RES",
alpha = 0.6,
style = "quantile",
size = 0.2,
title = "MLR Residuals"
) +
tm_layout(
outer.margins = c(0.1, 0, 0, 0),
main.title = "Multiple Linear Regression Residuals",
main.title.position = "center",
main.title.size = 1.5,
legend.position = c("left", "bottom"),
legend.title.size = 1.2,
legend.text.size = 0.8,
frame = TRUE,
inner.margins = c(0.05, 0.05, 0.05, 0.05)
)
Observations
In the plot above, there are indications of spatial autocorrelation, we need to statistically validate this observation.
To proof that our observation is indeed true, we will perform the Moran’s I test.
\(H_o\): The residuals are randomly distributed (also known as spatial stationary) .
\(H_1\): The residuals are spatially non-stationary.
First, we will compute the distance-based weight matrix by using dnearneigh() function of spdep.
resale_res_sf <- resale_res_sf %>%
mutate(nb = st_knn(geometry, k=6,
longlat = FALSE),
wt = st_weights(nb,
style = "W"),
.before = 1)Next, global_moran_perm() of sfdep is used to perform global Moran permutation test. We set a seed for reproducibility.
set.seed(1234)
mlr_moran_perm <- global_moran_perm(resale_res_sf$MLR_RES,
resale_res_sf$nb,
resale_res_sf$wt,
alternative = "two.sided",
nsim = 99)Then we save the result for future analysis:
write_rds(mlr_moran_perm, "data/rds/mlr_moran_perm.rds")mlr_moran_perm <- read_rds("data/rds/mlr_moran_perm.rds")
mlr_moran_perm
Monte-Carlo simulation of Moran I
data: x
weights: listw
number of simulations + 1: 100
statistic = 0.71167, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
Observations:
The Global Moran’s I test for residual spatial autocorrelation shows that it’s p-value is less than 2.2e-16 which is less than the alpha value of 0.05. Hence, we will reject the null hypothesis that the residuals are randomly distributed.
Since the Observed Global Moran I = 0.71167 which is greater than 0, we can infer than the residuals resemble cluster distribution.
16 Random Forest (RF)
Referencing In-Class Excercise 08, we proceed to build out RF model with ranger.
Since the SpatialML package is based on the ranger package, coordinate data must be prepared before calibration.
resale_test_sf <- read_rds("data/rds/resale_geocoded_jit_ft_test_sf.rds")coords <- st_coordinates(resale_sf)
coords_train <- st_coordinates(resale_train_sf)
coords_test <- st_coordinates(resale_test_sf)
write_rds(coords, "data/rds/coords.rds")
write_rds(coords_train, "data/rds/coords_train.rds")
write_rds(coords_test, "data/rds/coords_test.rds")Additionally, the geometry field is removed:
train_nogeo <- resale_train_sf %>%
st_drop_geometry()16.1 Calibrating RF Model
When calibrating the Random Forest (RF) model, we reduce num.trees from the typical 500 to 50 to decrease model complexity and computation time.
Additionally, we set importance = "impurity" to calculate variable importance based on the reduction in node impurity, allowing us to understand which predictors most strongly impact the model’s decisions. This will be analyzed collectively later on.
Note: we set a seed for reproducibility.
set.seed(1234)
rf <- ranger(resale_price ~ floor_area_sqm+ remaining_lease+ storey_order+
hip+ mup+ PROX_CBD+ PROX_CC+ PROX_LIB+ PROX_SPORTS+
PROX_ELDERCARE
#+ PROX_PARK
+ PROX_HAWKER+ PROX_TRAIN+ PROX_SMKT+
PROX_GD_SCH+ PROX_MALLS+ NUM_KINDERGARTEN_350M+ NUM_CHILDCARE_350M
# +NUM_BUS_STOP_350M
+ NUM_PRI_SCH_1KM+ NUM_CLINIC_350M,
data=train_nogeo,
num.trees = 50,
importance="impurity") As usual, we save the model for downstream model evaluation:
write_rds(rf, "data/rds/rf.rds")rf <- read_rds("data/rds/rf.rds")
rfRanger result
Call:
ranger(resale_price ~ floor_area_sqm + remaining_lease + storey_order + hip + mup + PROX_CBD + PROX_CC + PROX_LIB + PROX_SPORTS + PROX_ELDERCARE + PROX_HAWKER + PROX_TRAIN + PROX_SMKT + PROX_GD_SCH + PROX_MALLS + NUM_KINDERGARTEN_350M + NUM_CHILDCARE_350M + NUM_PRI_SCH_1KM + NUM_CLINIC_350M, data = train_nogeo, num.trees = 50, importance = "impurity")
Type: Regression
Number of trees: 50
Sample size: 5852
Number of independent variables: 19
Mtry: 4
Target node size: 5
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 1724866962
R squared (OOB): 0.9137379
Observations
As compared to resale_mlr2 model, rf has a higher R-squared (0.91) indicating that the model explains a significant portion of the variance in resale prices.
Further model evaluation will be explored later on.
17 Geographically Weighted Random Forest (gwRF)
Geographically Weighted Random Forest (gwRF) is an extension of the traditional Random Forest (RF) model that accounts for spatial heterogeneity in the data.
While a standard RF model generates predictions based solely on global patterns across all observations, gwRF introduces localized weights based on spatial coordinates. This allows it to capture variations in relationships between predictors and the target variable across different geographic locations.
17.1 Calibrating Geographically Weighted Random Forest (gwRF)
Before we calibrate the gwRF model, we first calculate the optimal bandwidth for gwRF model. This step is essential to account for local variations in the relationship between predictors and the target variable (resale_price) across different geographic locations.
In this case, we use cross-validation to select the optimal bandwidth, which helps ensure that the model generalizes well by finding a balance between model fit and complexity.
bw_adaptive <- bw.gwr(resale_price ~ floor_area_sqm+ remaining_lease+ storey_order+
hip+ mup+ PROX_CBD+ PROX_CC+ PROX_LIB+ PROX_SPORTS+
PROX_ELDERCARE
#+ PROX_PARK
+ PROX_HAWKER+ PROX_TRAIN+ PROX_SMKT+
PROX_GD_SCH+ PROX_MALLS+ NUM_KINDERGARTEN_350M+ NUM_CHILDCARE_350M
# +NUM_BUS_STOP_350M
+ NUM_PRI_SCH_1KM+ NUM_CLINIC_350M,
data=resale_res_sf,
approach="CV",
kernel="gaussian",
adaptive=TRUE,
longlat=FALSE)We save the output for reference:
write_rds(bw_adaptive, "data/rds/bw_adaptive.rds")bw_adaptive <- read_rds("data/rds/bw_adaptive.rds")
bw_adaptive[1] 72
Observations
The optimal bandwidth is 72 meters.
Then we calibrate a gwRF model using the previously calculated adaptive bandwidth (bw_adaptive) to set the range of influence for each data point in the geographic weighting.
Note: we set a seed for reproducibility.
set.seed(1234)
gwRF_adaptive <- grf(formula = resale_price ~ floor_area_sqm+ remaining_lease+ storey_order+
hip+ mup+ PROX_CBD+ PROX_CC+ PROX_LIB+ PROX_SPORTS+
PROX_ELDERCARE
#+ PROX_PARK
+ PROX_HAWKER+ PROX_TRAIN+ PROX_SMKT+
PROX_GD_SCH+ PROX_MALLS+ NUM_KINDERGARTEN_350M+ NUM_CHILDCARE_350M
# +NUM_BUS_STOP_350M
+ NUM_PRI_SCH_1KM+ NUM_CLINIC_350M,
dframe=train_nogeo,
bw = bw_adaptive,
ntree = 50,
kernel="adaptive",
verbose = TRUE,
coords=coords_train)We save the output for reference:
write_rds(gwRF_adaptive, "data/rds/gwRF_adaptive.rds")18 Model Evaluation
In this section, we evaluate three models—Multiple Linear Regression (MLR), Random Forest (RF), and Geographically Weighted Random Forest (GWRF)—using the test dataset. We load each saved model and generate predictions to assess their performance. These models are trained using the same set of variables to ensure a fair comparison.
model_mlr <- read_rds("data/rds/resale_mlr2.rds")
model_rf <- read_rds("data/rds/rf.rds")
model_gwrf <- read_rds("data/rds/gwRF_adaptive.rds")18.1 Generating Predictions
-
Multiple Linear Regression (MLR): We predict resale prices using
model_mlron the test dataset.
resale_test_sf$pred_mlr <- predict(object = model_mlr, newdata = resale_test_sf)-
Random Forest (RF): For RF predictions, we remove geometry data from the test dataset (
test_nogeo) to ensure compatibility withmodel_rf.
test_nogeo <- resale_test_sf %>%
st_drop_geometry()
rf_pred <- predict(model_rf, data = test_nogeo)
resale_test_sf$pred_rf <- rf_pred$predictions-
Geographically Weighted Random Forest (GWRF): The GWRF model requires both the test dataset and coordinates (
coords_test) to incorporate spatial variability into predictions.
gwRF_test_data <- cbind(test_nogeo, coords_test)
gwRF_pred <- predict.grf(model_gwrf,
gwRF_test_data,
x.var.name="X",
y.var.name="Y",
local.w=1,
global.w=0)
resale_test_sf$pred_grf <- gwRF_predThen we save the output for reference:
write_rds(resale_test_sf, "data/rds/model_eval.rds")18.2 Visualize Predictions by Model
Building on the previous section, we now visualize the predicted resale prices against the actual resale prices for three different models: Multiple Linear Regression (MLR), Random Forest (RF), and Geographically Weighted Random Forest (gwRF).
Ideally, the points should align along a straight line, indicating that the predictions closely match the actual values. Any deviation from this line suggests that the model is either overestimating or underestimating resale prices in certain ranges.
model_eval <- read_rds("data/rds/model_eval.rds")Using a custom plot_pred function, we can visualize the scatterplots side by side:
Code
plot_pred <- function(data, x, y, title) {
ggplot(data, aes(x = .data[[x]], y = .data[[y]])) +
geom_point() +
ggtitle(title) +
theme(plot.title = element_text(hjust = 0.5))
}Code
plot_mlr <- plot_pred(model_eval, "pred_mlr", "resale_price", "[MLR] Actual vs Pred")
plot_rf <- plot_pred(model_eval, "pred_rf", "resale_price", "[RF] Actual vs Pred")
plot_grf <- plot_pred(model_eval, "pred_grf", "resale_price", "[gwRF] Actual vs Pred")
ggarrange(plot_mlr, plot_rf, plot_grf, ncol = 3, nrow = 1)
Observations
- Multiple Linear Regression (MLR): The predicted values show a general trend aligning with the actual values, but there is noticeable spread as compared to other models.
- Random Forest (RF): This model produces a more compact scatter, showing better alignment with actual values, particularly in the mid-range prices.
- Geographically Weighted Random Forest (gwRF): The predictions appear similar to RF.
While the visualizations suggest some differences in performance, it’s challenging to determine which model is superior based on these plots alone. We will further evaluate the models using RSME and MAE to provide a clearer assessment.
18.3 Qualititive Metrics Assessment
The code below calculates and displays the RMSE (Root Mean Square Error) and MAE (Mean Absolute Error) for three different models and format the results neatly using kable.
Code
model_predictions <- list(
MLR = model_eval$pred_mlr,
RF = model_eval$pred_rf,
gwRF = model_eval$pred_grf
)
metrics <- lapply(model_predictions, function(pred) {
c(RMSE = rmse(model_eval$resale_price, pred),
MAE = mae(model_eval$resale_price, pred))
})
metrics_df <- as.data.frame(do.call(rbind, metrics))
kable(metrics_df) %>%
kable_styling(font_size = 18)| RMSE | MAE | |
|---|---|---|
| MLR | 96025.16 | 73946.48 |
| RF | 70811.57 | 54771.09 |
| gwRF | 72055.22 | 55628.60 |
Observations
Based on the RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) values, the Random Forest (RF) model performs best for predicting 5-room resale flat prices in the test set (July - September 2024):
- RF has the lowest RMSE and MAE, showing higher accuracy and lower prediction error than both MLR and gwRF.
- Geographically Weighted RF (gwRF) has slightly higher RMSE and MAE, making it the second-best model.
- MLR has the highest RMSE and MAE, performing worst among the three.
This result is unexpected, as gwRF was expected to perform better due to additional geographical information. Let’s analyze RMSE and MAE at the town level to investigate further.
We will further investigate by identifying the best-performing model for each town, focusing on RMSE as our primary evaluation metric due to its sensitivity to larger errors, which is crucial for assessing model accuracy in predicting resale prices.
To do so, we created a dataframe metrics_town_df to calculate the RMSE and MAE for each model (MLR, RF, and gwRF) for each town. It then identifies the model with the lowest RMSE for each town and records it in the best_model_RMSE column.
Code
metrics_town_df <- model_eval %>%
group_by(town) %>%
summarise(
RMSE_MLR = rmse(resale_price, pred_mlr),
MAE_MLR = mae(resale_price, pred_mlr),
RMSE_RF = rmse(resale_price, pred_rf),
MAE_RF = mae(resale_price, pred_rf),
RMSE_gwRF = rmse(resale_price, pred_grf),
MAE_gwRF = mae(resale_price, pred_grf)
) %>%
rowwise() %>%
mutate(
best_model_RMSE = case_when(
RMSE_MLR <= RMSE_RF & RMSE_MLR <= RMSE_gwRF ~ "MLR",
RMSE_RF <= RMSE_MLR & RMSE_RF <= RMSE_gwRF ~ "RF",
TRUE ~ "gwRF"
)
) %>%
ungroup()table(metrics_town_df$best_model_RMSE)
gwRF MLR RF
10 3 13
From above, we can use that RF is the prefered model for 13 towns and gwRF is a close second at 10.
To visualize the results on a map:
Code
metrics_town_df <- metrics_town_df %>%
st_drop_geometry() %>%
mutate(town = ifelse(town == "KALLANG/WHAMPOA", "KALLANG", town)) %>% # to match with mpsz
mutate(town = ifelse(town == "CENTRAL AREA", "DOWNTOWN CORE", town)) # to match with mpsz
tmap_mode("view")
tmap_options(check.and.fix = TRUE)
best_model_by_town <- mpsz_sf %>%
left_join(metrics_town_df, by = c("PLN_AREA_N" = "town"))
tm_shape(best_model_by_town) +
tm_polygons("best_model_RMSE", alpha = 1, title = "Best Model (RMSE)") +
tm_layout(
outer.margins = c(0.1, 0, 0, 0),
main.title = "Best Model for Resale Price Prediction by Town",
main.title.position = "center",
main.title.size = 1.5,
legend.position = c("left", "bottom"),
legend.title.size = 1.2,
legend.text.size = 0.8,
frame = TRUE,
inner.margins = c(0.05, 0.05, 0.05, 0.05)
)Observations
The map visualizes the best-performing model for predicting resale prices by town, based on RMSE. The distribution of models shows that:
- The Random Forest (RF) model (in purple) is the best fit for many northern and western regions.
- The Geographically Weighted Random Forest (gwRF) model (in green) performs best in central and southeastern areas.
- Interestingly, the Multiple Linear Regression (MLR) model (in yellow) only outperforms in the Hougang area.
18.4 Variable Importance
Another way to understand these models is to study their variable importance. Variable importance quantifies how much a specific variable helps in making accurate predictions of the target variable (dependent variable). Higher importance means that the variable has a stronger influence on the model’s predictions.
To do so, we extract the importance data from the models and visualize them.
For the MLR model, we can use varImp() of caret package to retrieve the importance of variables in the MLR model.
mlr_vi_df <- varImp(model_mlr, scale = FALSE) %>%
mutate(importance = .[[1]]) %>%
rename(variable = 1) %>%
select(importance) %>%
rownames_to_column(var = "variable") %>%
arrange(desc(importance))For RF and gwRF model, we can extract the variable.importance data easily. We then sort the data by importance in the similar format as above.
rf_vi_df <- as.data.frame(model_rf$variable.importance) %>%
mutate(importance = .[[1]]) %>%
select(importance) %>%
rownames_to_column(var = "variable") %>%
arrange(desc(importance)) gwrf_vi_df <- as.data.frame(model_gwrf$Global.Model$variable.importance) %>%
mutate(importance = .[[1]]) %>%
select(importance) %>%
rownames_to_column(var = "variable") %>%
arrange(desc(importance)) To plot the variable importance of each model, we will use a custom function, plot_variable_importance using the plotly library:
Code
plot_variable_importance <- function(data, title = "Variable Importance", fill_color = "steelblue") {
plot_ly(
data,
x = ~importance,
y = ~reorder(variable, importance),
type = "bar",
orientation = "h",
marker = list(color = fill_color)
) %>%
layout(
title = title,
xaxis = list(title = "Importance"),
yaxis = list(title = "Variable")
)
}Code
plot_variable_importance(mlr_vi_df, "[MLR] Variable Importance", "slateblue")Code
plot_variable_importance(rf_vi_df, "[RF] Variable Importance", "orange")Code
plot_variable_importance(gwrf_vi_df, "[gwRF] Variable Importance", "seagreen")Comparsion of Variable Importance across mdoels
Across all three models (MLR, RF, gwRF), the top predictors are consistently
PROX_CBD,remaining_lease,storey_order,floor_area_sqm, indicating these factors play crucial roles in predicting resale prices.In MLR,
PROX_CBDandremaining_leasehave a more pronounced lead in importance over other variables compared to RF and gwRF.To my surprise,
PROX_LIBandPROX_HAWKERhave higher variable importance in RF-based models than commonly expected variables likePROX_TRAIN(proximity to train stations) orPROX_MALLS(proximity to malls).Variables like
mupandhipexhibit very low importance across all models, indicating limited predictive power. This is expected, as observed in the EDA, a significant portion of resale transactions involve recently MOP-ed (5-year-old) flats, whereas HIP/MUP typically applies to flats that are 30 years old or more.
18.5 Visualing Errors
Given the similarity in Variable Importance for the RF-based models (RF and GWRF), we can gain further insights by examining the spatial distribution of prediction errors. This interactive map will help us identify areas where each model performs well or struggles, highlighting any spatial patterns in the errors.
Using the model_eval dataframe, calculate the error for each model by subtracting the predicted resale prices from the actual resale prices.
The errors for RF and gwRF models are computed as follows:
model_eval <- model_eval %>%
mutate(
error_rf = resale_price - pred_rf,
error_grf = resale_price - pred_grf
)To create a side-by-side box plot of RF and gwRF errors:
plot_ly(model_eval, y = ~error_rf, type = "box", name = "RF Error") %>%
add_trace(y = ~error_grf, type = "box", name = "gwRF Error") %>%
layout(
title = "Comparison of RF and gwRF Errors",
yaxis = list(title = "Error"),
boxmode = "group" # Places the boxes side by side
)Observations
The box plot shows that the RF model has a tighter error distribution around zero, with fewer extreme outliers, indicating more consistent predictions.
In comparison, the gwRF model has a wider spread with more outliers, suggesting higher variability and larger errors in certain cases.
Overall, RF appears to provide more stable predictions.
Next, we find and label extreme error rows for each model, so that we can understand them on the interactive map.
error_extremes_df <- bind_rows(
model_eval %>% filter(error_rf == min(error_rf, na.rm = TRUE)) %>% mutate(error_type = "min_rf"),
model_eval %>% filter(error_rf == max(error_rf, na.rm = TRUE)) %>% mutate(error_type = "max_rf"),
model_eval %>% filter(error_grf == min(error_grf, na.rm = TRUE)) %>% mutate(error_type = "min_grf"),
model_eval %>% filter(error_grf == max(error_grf, na.rm = TRUE)) %>% mutate(error_type = "max_grf")
) %>%
select(error_type, everything())
error_extremes_dfSimple feature collection with 4 features and 34 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 20857.34 ymin: 29478.58 xmax: 29217.29 ymax: 38519.5
Projected CRS: SVY21 / Singapore TM
# A tibble: 4 × 35
error_type month town flat_type floor_area_sqm flat_model remaining_lease
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 min_rf 2024-07 CLEMEN… 5 ROOM 123 Improved 714
2 max_rf 2024-09 BUKIT … 5 ROOM 113 Improved 1050
3 min_grf 2024-07 CLEMEN… 5 ROOM 123 Improved 714
4 max_grf 2024-09 ANG MO… 5 ROOM 121 Improved 1046
# ℹ 28 more variables: resale_price <dbl>, address <chr>, geometry <POINT [m]>,
# storey_order <int>, age <dbl>, hip <dbl>, mup <dbl>, PROX_CBD <dbl>,
# PROX_CC <dbl>, PROX_LIB <dbl>, PROX_SPORTS <dbl>, PROX_ELDERCARE <dbl>,
# PROX_PARK <dbl>, PROX_HAWKER <dbl>, PROX_TRAIN <dbl>, PROX_SMKT <dbl>,
# PROX_GD_SCH <dbl>, PROX_MALLS <dbl>, NUM_KINDERGARTEN_350M <dbl>,
# NUM_CHILDCARE_350M <dbl>, NUM_BUS_STOP_350M <dbl>, NUM_PRI_SCH_1KM <dbl>,
# NUM_CLINIC_350M <dbl>, pred_mlr <dbl>, pred_rf <dbl>, pred_grf <dbl>, …
Observations
It is interesting to observe that both the minimum errors for the RF and gwRF models occur with the same flat in Clementi.
To create the interactive point symbol map:
Code
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
error_rf <- tm_shape(mpsz_sf)+
tm_polygons(alpha = 0.1) +
tm_shape(model_eval) +
tm_dots(col = "error_rf",
border.col = "gray60",
border.lwd = 1) +
tm_view(set.zoom.limits = c(11,14))
error_grf<- tm_shape(mpsz_sf)+
tm_polygons(alpha = 0.1) +
tm_shape(model_eval) +
tm_dots(col = "error_grf",
border.col = "gray60",
border.lwd = 1) +
tm_view(set.zoom.limits = c(11,14))
tmap_arrange(error_rf, error_grf,
asp=1, ncol=2,
sync = TRUE)Observations
Both maps show similar spatial error patterns for the RF and gwRF models. Slight differences in the error magnitudes can be observed, particularly with gwRF showing slightly higher error values in certain areas (Clementi North), as seen in the expanded scale on the right map.
19 Conclusion
In conclusion, this exercise aimed to predict resale prices of 5-room HDB flats in Singapore using various geospatial and aspatial features, ultimately comparing three models: Multiple Linear Regression (MLR), Random Forest (RF), and Geographically Weighted Random Forest (gwRF).
The data acquisition process proved to be a challenging and eye-opening experience. Combining data from multiple web sources and government portals required navigating numerous steps and technical hurdles, from dealing with changing data structures on official portals to geocoding addresses accurately. This process highlighted the importance of knowing where to look and handle different data sources, reinforcing the critical role of data sourcing skills in real-world projects.
In the feature engineering phase, we learned to manipulate geospatial data to create relevant variables, such as proximity-based and radius-based features. These features, including proximity to the Central Business District (CBD), schools, parks, and other amenities, added valuable locational context to the dataset. The EDA phase revealed general trends and provides a deeper understanding of the dataset. We also used multicollinearity checks to select variables, ensuring that the model wasn’t impacted by redundant predictors.
The model-building process was highly iterative. Initially, we included all variables, then gradually refined the models by removing statistically insignificant ones. For fair comparison, all models uses the same list of variables.
The model evaluation phase involved multiple steps, including prediction visualization and quantitative metrics such as RMSE and MAE. We further assessed performance at the town level to gain insights into model behavior across different regions, which highlighted the strengths and weaknesses of each model. Additionally, we examined variable importance (VI) to understand which factors were most influential in predicting resale prices. Across all models, the top three most important variables were PROX_CBD (proximity to CBD), remaining_lease (years left on the lease), and storey_order (floor level).
-
PROX_CBD: Proximity to Singapore’s CBD emerged as one of the strongest predictors, reflecting the premium associated with central locations and accessibility to economic and lifestyle hubs. Flats closer to the CBD generally command higher resale values due to their desirability and convenience. -
remaining_lease: This factor represents the number of years left on the flat’s 99-year lease. Properties with more remaining lease years tend to have higher resale prices, as buyers value the longevity of their investment. -
storey_order: Floor level also emerged as a key predictor, with higher floors generally commanding a premium, likely due to better views, reduced noise, and a greater sense of privacy. This trend aligns with general buyer preferences for high-floor units in densely populated urban environments.
Among the three models, RF surprisingly performed best, delivering the highest accuracy and lowest prediction error, even outperforming the spatially aware gwRF model. While gwRF was expected to excel due to its ability to account for spatial heterogeneity, RF’s performance suggests that the influence of geographic variations might be less pronounced for the 5-room flat resale prices within the selected timeframe. Further investigation at the town level showed that RF performed consistently well across most towns, while gwRF was a close second, particularly excelling in central and southeastern areas. For the final comparsion, we used an interactive map to visualize prediction errors, highlighting areas where RF and gwRF performed well or struggled, revealing spatial trends and unique town characteristics affecting model accuracy.
Overall, the use of RF and gwRF models demonstrated the power of machine learning in capturing complex relationships to predict resale prices accurately. For aspiring homeowners, real estate analysts, and policymakers, these insights offer a valuable perspective on the factors driving HDB resale prices across Singapore, setting the stage for future work in geographically weighted modeling and housing market analysis.